Let’s continue with the digits data. We read-in the data:
url <- "https://raw.githubusercontent.com/datasciencelabs/data/master/hand-written-digits-train.csv"
#if(!exists("digits")) digits <- read_csv(url)
digits <- read_csv(url)
To simplify the problem we will try to distinguish 2s from 7s. So we subset to only those digits.
dat <- digits %>% filter(label %in% c(2,7))
For illustrative purposes we created two features: X_1
is the percent of non-white pixels that are in the top left quadrant and
X_2 is the percent of non-white pixels that are in the
bottom right quadrant:
We can create these new predictors like we did before:
dat <- mutate(dat, label = as.character(label)) %>%
mutate(y = ifelse(label == "2",0,1 ))
row_column <- expand.grid(row=1:28, col=1:28)
ind1 <- which(row_column$col <= 14 & row_column$row <= 14)
ind2 <- which(row_column$col > 14 & row_column$row > 14)
ind <- c(ind1,ind2)
X <- as.matrix(dat[,-1])
X <- X > 200
X1 <- rowSums(X[,ind1])/rowSums(X)
X2 <- rowSums(X[,ind2])/rowSums(X)
dat <- mutate(dat, X_1 = X1, X_2 = X2)
We can see some examples of what these predictors (pixels) are:
We act as if we know the truth (the boundary that we should use to distinguish 2s from 7s):
For illustration purposes let’s take a subset of 1,000 handwritten digits:
set.seed(1)
dat <- sample_n(dat, 1000) %>% dplyr::select(y, X_1, X_2)
head(dat)
## # A tibble: 6 × 3
## y X_1 X_2
## <dbl> <dbl> <dbl>
## 1 1 0.238 0.417
## 2 0 0.0610 0.293
## 3 0 0.151 0.315
## 4 0 0.0957 0.255
## 5 0 0.189 0.256
## 6 0 0 0.234
Now create train and test sets:
inTrain <- createDataPartition(y = dat$y, p = 0.5, times = 1, list = FALSE)
train_set <- slice(dat, inTrain)
test_set <- slice(dat, -inTrain)
Quadratic Discriminant Analysis (QDA) relates to the Naive Bayes approach we described earlier. We try to estimate \(\mbox{Pr}(Y=1|X=x)\) using Bayes theorem.
\[ p(x) = \mbox{Pr}(Y=1|\mathbf{X}=\mathbf{x}) = \frac{\pi f_{\mathbf{X}|Y=1}(x)} {(1-\pi) f_{\mathbf{X}|Y=0}(x) + \pi f_{\mathbf{X}|Y=1}(x)} \]
With QDA, we assume that each predictor is normally distributed and allow for the decision boundary to be curved (quadratic). Here, we have two predictors, so we need to estimate two averages, two standard deviations, and a correlation for each case: \(Y=1\) and \(Y=0\):
#options(digits = 2)
params <- train_set %>% group_by(y) %>%
summarize(avg_1 = mean(X_1), avg_2 = mean(X_2), sd_1= sd(X_1), sd_2 = sd(X_2), r = cor(X_1,X_2))
params
## # A tibble: 2 × 6
## y avg_1 avg_2 sd_1 sd_2 r
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.130 0.286 0.0674 0.0592 0.436
## 2 1 0.232 0.292 0.0750 0.107 0.360
We can use these parameter estimates to calculate the QDA estimates
by hand, but we can also perform QDA automatically using the
qda function from the MASS package.
qda_fit <- qda(y ~ X_1 + X_2, data = train_set)
qda_preds <- predict(qda_fit, test_set)
This defines the following estimate of \(f(x_1, x_2)\)
This fits the “truth” extremely well. Let’s see how well we perform
on the test set. Note that the class prediction (0 or 1) can be accessed
using $class after using the predict function
to make predictions and the class probabilites can be accessed using
$posterior. We get an overall accuracy of 0.85 and balanced
sensitivity and specificity.
confusionMatrix(data = as.factor(qda_preds$class), reference = as.factor(test_set$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 219 35
## 1 40 206
##
## Accuracy : 0.85
## 95% CI : (0.8156, 0.8802)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6998
##
## Mcnemar's Test P-Value : 0.6442
##
## Sensitivity : 0.8456
## Specificity : 0.8548
## Pos Pred Value : 0.8622
## Neg Pred Value : 0.8374
## Prevalence : 0.5180
## Detection Rate : 0.4380
## Detection Prevalence : 0.5080
## Balanced Accuracy : 0.8502
##
## 'Positive' Class : 0
##
Here we have 2 predictors and had to compute 4 means, 4 SDs and 2 correlations. How many parameters would we have if instead of 2 predictors we had 10?
The main problem comes from estimating correlations for 10 predictors. With 10, we have 45 correlations for each class. In general the formula is \(p(p-1)/2\) which gets big quickly.
A solution to this is to assume that the correlation structure is the same for all classes. This reduces the number of parameters we need to estimate. When we do this, we can show mathematically that the solution is “linear”, in the linear algebra sense, and we call it Linear Discriminant Analysis (LDA).
params <- train_set %>% group_by(y) %>%
summarize(avg_1 = mean(X_1),
avg_2 = mean(X_2),
sd_1= sd(X_1),
sd_2 = sd(X_2),
r = cor(X_1,X_2))
params <- params %>% mutate(sd_1 = mean(sd_1),
sd_2=mean(sd_1),
r=mean(r))
params
## # A tibble: 2 × 6
## y avg_1 avg_2 sd_1 sd_2 r
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.130 0.286 0.0712 0.0712 0.398
## 2 1 0.232 0.292 0.0712 0.0712 0.398
With this simplification we gain computation speed and feasibility,
but lose accuracy. We can fit an LDA model automatically using the
lda function, similar to the qda function. We
see a dip in overall accuracy to 0.816.
lda_fit <- lda(y ~ X_1 + X_2, data = train_set)
lda_preds <- predict(lda_fit, test_set)
confusionMatrix(data = as.factor(lda_preds$class), reference = as.factor(test_set$y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 207 40
## 1 52 201
##
## Accuracy : 0.816
## 95% CI : (0.7792, 0.849)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6322
##
## Mcnemar's Test P-Value : 0.2515
##
## Sensitivity : 0.7992
## Specificity : 0.8340
## Pos Pred Value : 0.8381
## Neg Pred Value : 0.7945
## Prevalence : 0.5180
## Detection Rate : 0.4140
## Detection Prevalence : 0.4940
## Balanced Accuracy : 0.8166
##
## 'Positive' Class : 0
##
This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes linear:
With the example we have been examining we can make two types of errors: calling a 2 a 7 or calling a 7 a 2. More generally, with binary data we call these false positives (calling a 0 a 1) and false negatives (calling a 1 a 0). Here we have arbitrarily made 7s 1s and 2s 0s.
This concept is important in many areas and in particular in health where one type of mistake can be much more costly than another. Note that we have been predicting 1s based on the rule \(\hat{f}(x_1, x_2) > 0.5\) but we can pick another cutoff, depending on the cost of each type of error. For example, if we are predicting if a plane will malfunction, then we want a very low false negative rate and are willing to sacrifice our true positive rate.
We can see that the estimated probabilities are on a continuum:
pred_qda <- qda_preds$posterior[,2]
test_set %>% mutate(pred=pred_qda, label=as.factor(y)) %>%
ggplot(aes(label,pred)) + geom_boxplot()
The Receiver Operator Characteristic Curve (ROC Curve) plots true positive rate versus false positive rate for several choices of cutoff. We can create this curve using base R with the following code:
library(pROC)
roc_qda <- roc(test_set$y, pred_qda)
plot(roc_qda)
We can also create a nicer looking plot using ggroc:
ggroc(list("QDA" = roc_qda)) +
theme(legend.title = element_blank()) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
xlab("Specificity") +
ylab("Sensitivity")
Here are the results for LDA
pred_lda <- lda_preds$posterior[,2]
roc_lda <- roc(test_set$y, pred_lda)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("QDA" = roc_qda, "LDA" = roc_lda)) +
theme(legend.title = element_blank()) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
xlab("Specificity") +
ylab("Sensitivity")
We can also compare to KNN with two different values of
k: 5 and 51.
fit <- knn3(y~., data = train_set, k = 5)
pred_knn_5 <- predict(fit, newdata = test_set)[,2]
roc_knn_5 <- roc(test_set$y, pred_knn_5)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
fit <- knn3(y~., data = train_set, k = 51)
pred_knn_51 <- predict(fit, newdata = test_set)[,2]
roc_knn_51 <- roc(test_set$y, pred_knn_51)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("QDA" = roc_qda, "LDA" = roc_lda, "kNN, k = 5" = roc_knn_5, "kNN, k = 51" = roc_knn_51)) +
theme(legend.title = element_blank()) +
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color = "black", linetype = "dashed") +
xlab("Specificity") +
ylab("Sensitivity")
To summarize these curves into one single number that can be compared across methods, it is common to take the area under the ROC curve (abbreviated AUC or AUROC). Higher values indicate better performance.
auc(roc_qda)
## Area under the curve: 0.9226
auc(roc_lda)
## Area under the curve: 0.8934
auc(roc_knn_5)
## Area under the curve: 0.8822
auc(roc_knn_51)
## Area under the curve: 0.9171
Here we look at an example where there are three classes to consider. In addition to 2s and 7s, we also consider 1s. Now we need to estimate three different curves that represent the conditional probabilities of being each digit given the values of \(X_1\) and \(X_2\).
We’ll create a training set and test set so we can test out how each method does on the three class problem.
set.seed(1)
dat <- sample_n(dat, 3000) %>% dplyr::select(label, X_1, X_2) %>% dplyr::mutate(label = as.factor(label))
inTrain <- createDataPartition(y = dat$label, p = 0.5, times = 1, list = FALSE)
train_set <- slice(dat, inTrain)
test_set <- slice(dat, -inTrain)
train_set %>% ggplot(aes(X_1,X_2,fill=label)) +
geom_point(cex=3, pch=21)
This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes linear:
fit_lda = lda(label ~ X_1 + X_2, data = train_set)
This defines the following estimate of \(f(x_1, x_2)\) and the boundary becomes non-linear:
fit_qda = qda(label ~ X_1 + X_2, data = train_set)
Let’s compare the performance of LDA and QDA with logistic
regression. For each label, we will train a model to predict that
number, or not that number. So fit1 predicts 1
vs not 1, fit2 predicts 2 vs
not 2, and fit7 predicts 7 vs
not 7. Then we can also plot these boundaries.
fit1 <- glm(y~X_1+X_2, data=mutate(train_set,
y=label=="1"),family="binomial")
fit2 <- glm(y~X_1+X_2, data=mutate(train_set,
y=label=="2"),family="binomial")
fit7 <- glm(y~X_1+X_2, data=mutate(train_set,
y=label=="7"),family="binomial")
Let’s also use kNN with k = 51 and k = 101
and draw the boundaries.
fit_51 <- knn3(label ~ ., data = train_set, k = 51)
fit_101 <- knn3(label ~ ., data=train_set, k = 101)
After training each model we can predict labels for the test set and then compare the accuracy of each method.
##QDA
pred_qda <- predict(fit_qda, test_set)
##LDA
pred_lda <- predict(fit_lda, test_set)
##GLM
f_hat1 <- predict(fit1, newdata = test_set, type = "response")
f_hat2 <- predict(fit2, newdata = test_set, type = "response")
f_hat7 <- predict(fit7, newdata = test_set, type = "response")
pred_glm <- apply(cbind(f_hat1, f_hat2, f_hat7), 1, which.max)
##KNN 51
f_hat <- predict(fit_51, newdata = test_set)
pred_knn_51 <- apply(f_hat, 1, which.max)
##KNN 101
f_hat <- predict(fit_101, newdata = test_set)
pred_knn_101 <- apply(f_hat, 1, which.max)
Let’s compare:
tab <- table(factor(pred_lda$class, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)
## Confusion Matrix and Statistics
##
##
## 1 2 7
## 1 403 202 28
## 2 50 151 75
## 7 78 109 403
##
## Overall Statistics
##
## Accuracy : 0.6384
## 95% CI : (0.6135, 0.6628)
## No Information Rate : 0.3542
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4528
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 7
## Sensitivity 0.7589 0.3268 0.7964
## Specificity 0.7624 0.8795 0.8117
## Pos Pred Value 0.6367 0.5471 0.6831
## Neg Pred Value 0.8522 0.7457 0.8867
## Prevalence 0.3542 0.3082 0.3376
## Detection Rate 0.2688 0.1007 0.2688
## Detection Prevalence 0.4223 0.1841 0.3936
## Balanced Accuracy 0.7607 0.6031 0.8041
confusionMatrix(tab)$overall[1]
## Accuracy
## 0.6384256
tab <- table(factor(pred_qda$class, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
## Accuracy
## 0.7431621
tab <- table(factor(pred_glm, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
## Accuracy
## 0.6124083
tab <- table(factor(pred_knn_51, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
## Accuracy
## 0.7464977
tab <- table(factor(pred_knn_101, labels=c("1","2","7")), test_set$label)
confusionMatrix(tab)$overall[1]
## Accuracy
## 0.7444963